Thanks to my Data Science Professor Hector Corida Bravo and his notes. These helped me immensily for this project.
A popular source of argument among friends, enemies, frenemies, and family members (which could be in any of the previous categories) is the state of gun ownership and gun violence in the United States today. I know I have (cordially of course) discussed this issue a fair bit, especially after the Vegas and Parkland shootings. A common train of disagreement that tends to come from these conversations is the issue of regulation. Some people associated with a particular political leaning argue that more regulation leads to fewer guns which leads to less gun violence. Other people with the opposite political inclination argue that the regulations would only be followed by people who wouldn’t be committing gun crimes anyway, and the regulations would just give wrongdoers more incentive to do wrong without the fear of an armed bystander. However, rather than presenting unbiased information to the public, news sources have decided to join one side of the argument or the other and sensationalize the issues to make more money. On the regulation-is-good side of the argument (henceforth referred to as pro-regulation), there is this article and this more scholarly article from Stanford. On the other, regulation-is-bad side of the argument (henceforth the anti-regulation stance) is this article. Because, as mentioned before, news is sensationalist and we can’t believe anything they say, we will do the data ourselves. We will be getting our information on gun crime from this Kaggle data set and cross referencing it with this data set on gun laws. We will play with weights based on issues that matter to us using this supporting table. Finally, we will adjust our gun crime rates for population using a dataset from the census bureau. After we store this data in R and make it pretty and clean, we will visualize it with multiple charts, perform some regression work, and even use ML to predict future gun violence rates. Of course, I don’t know where this data came from exactly, it could come from people who are as bad as the news sources. Also, we must understand that there are several factors at work in an issue like this one, and just because a correlation was found doesn’t mean there is causation.
In terms of hypothesis testing, our null hypothesis is that there is no relation between gun violence and gun regulation, and our alternative hypothesis is that there is a relation between gun violence and gun regulation. More general hypothesis test information can be found here.
As a final thought, I wrote the preceding without doing any data analysis, so I will be just as surprised by the results of these procedures when I finish running them as you will be when you read this for the first time!
We will be doing this project in the language R in the RStudio. You can download RStudio here. If you want to get some more background information on R and RStudio, visit this page, which has links to various resources for familiarizing yourself with R. I will try to be very thorough here too, however.
Once you have R installed and have familiarized yourself with it to the extent you desire, sign up for a Kaggle account (I used my burner Google account and it was quite easy) and download the Kaggle data set and the data set on gun laws as CSV files. The Kaggle will download as a zipped folder with a CSV in it, you have to choose CSV from a list of options for the other data. The gun laws will default to only 2017, use the tools to the left to select years 2013-2017, then click the ‘CSV’ button. Click on this link to download the census dataset. You can open these files in Microsoft Excell or another spreadsheet program to inspect the data or look at the raw CSVs in a text editor to get a feel for what R will actually be seeing.
Once you have RStudio open, got to File->New Project and create a directory for this project. This is what I will call the project’s home directory, and all associated files for the project should be put in this folder.
We will load in the CSV for the Kaggle gun violence data first. If you haven’t already put the CSV in the project’s home directory, do so.
#This library has everything and is fantastic, I've used it for
#every R project I've done
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
violence_full <- read.csv("gun-violence-data_01-2013_03-2018.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
violence_full %>% select(1:5)%>%head()
## incident_id date state city_or_county
## 1 461105 2013-01-01 Pennsylvania Mckeesport
## 2 460726 2013-01-01 California Hawthorne
## 3 478855 2013-01-01 Ohio Lorain
## 4 478925 2013-01-05 Colorado Aurora
## 5 478959 2013-01-07 North Carolina Greensboro
## 6 478948 2013-01-07 Oklahoma Tulsa
## address
## 1 1506 Versailles Avenue and Coursin Street
## 2 13500 block of Cerise Avenue
## 3 1776 East 28th Street
## 4 16000 block of East Ithaca Place
## 5 307 Mourning Dove Terrace
## 6 6000 block of South Owasso
As a note, all code will be commented with a max of two lines, and if I want to expound on my notes I will do it after the code block.
For example, the last line is what is called a pipeline. The ‘%>%’ operator takes the result of the thing in front of it and passes it as the first argument of the thing behind it. These pipes usually start with a dataset, in this case violence_full. Select picks columns, and this range means columns 1-5 inclusive. Head selects the first few rows of a dataset. If a line does nothing but create a value or mention a value without storing it, that value is printed.
Also, if you have not already installed a library, go to the packages tab (by default in the bottom right pane in RStudio), click the “install” button, type in the name of the package you are looking for, and click “Install.”
Now we will add the data set on gun laws and the census dataset. The government wasn’t kind enough to provide us with a CSV, so we will have to do a slightly different procedure for the .xlsx file they do provide.
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
gunlaws_full <- read.csv("raw_data.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
gunlaws_full %>% select(1:5)%>%head()
## state year age18longgunpossess age18longgunsale age21handgunpossess
## 1 Alabama 2013 0 0 0
## 2 Alaska 2013 0 1 0
## 3 Arizona 2013 1 0 0
## 4 Arkansas 2013 0 0 0
## 5 California 2013 0 1 0
## 6 Colorado 2013 0 0 0
#library for reading .xlsx
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
#the second argument in read.xlsx is the sheet index
#it can also be the sheet name
pops_full <- read.xlsx("nst-est2017-01.xlsx", 1)
pops_full %>% select(1:5)%>%head()
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
You will note that the dataframe for pops_full has a lot of junk in it. If you open it and the CSV for gun laws in Excell and compare them, you will notice that the .xlsx contains a lot more data than the gun laws CSV, and that most of this data will not be needed. We’ll deal with that in a bit.
If you want more information about reading in weird datafiles, check out this link.
If you look at the website for the categories, you’ll notice that there is no option to download the data. We will have to scrape it. As outlined in that link, we will have to tell R what sort of element it is looking for to collect the data from. If you don’t want to use the chrome extension the link talks about, you will have to inspect the element and select the class that that element is part of. You will put that in the argument of html_nodes. This will return a list of matches; you will have to select the one you want. In my case I wanted the first match it found.
#the library for scraping
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
#read in webpage.
#usually a web address
webpage <- read_html("State Firearm Laws - Use the Database Glossary Search Tool.html")
cats_full <- webpage %>%
html_nodes(".table") %>%
.[[1]] %>%
html_table(fill = TRUE)
head(cats_full)
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## Category/Subcategory
## 1 Possession regulationsAge restrictions
## 2 Buyer regulationsAge restrictions
## 3 Possession regulationsAge restrictions
## 4 Buyer regulationsAge restrictions
## 5 Possession regulationsAge restrictions
## 6 Buyer regulationsAge restrictions
One snag I ran into here is that the page is dynamically loaded, so it only worked when I right clicked on the page and selected “Save page as…” and put the result in my project’s home direcory.
As mentioned before, there is a lot of junk in the pops_full dataset. We only want the data and done of the fluff. I’ll remind you what this dataset looks like:
head(pops_full)
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
## NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 2012 2013 2014 2015 2016 2017
## 4 313993272 316234505 318622525 321039839 323405935 325719178
## 5 55860261 56047732 56203078 56296628 56359360 56470581
## 6 67318295 67534451 67720120 67839187 67978168 68179351
You’ll notice that we only want a few of the columns (called atributes): the ones with the state names and the dates. You will also notice that there are some extra rows (called entities). The following code trims them down.
#library for text manipulation
library(stringr)
#the arrow is a storage operator
pops_trimmed <- pops_full %>%
#removes attributes 2 and 3
select(-(2:3)) %>%
#removes first 8 entities
slice(-(1:8)) %>%
#removes last 6 entities
slice(-(52:58) ) %>%
#change the column names (temporarily for ease of use)
#the ticks let us use code we wouldn't usually put in a pipeline
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017")) %>%
#remove the first character in the string for State
mutate(State = str_sub(State, 2, 25)) %>%
#change 2010 (ticks because its a number) to numeric data
transform(`2010` = as.double(as.character(`2010`))) %>%
#change the column names (again...)
#try without this to see what happens.
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017"))
head(pops_trimmed)
## State 2010 2011 2012 2013 2014 2015
## 1 Alabama 4785579 4798649 4813946 4827660 4840037 4850858
## 2 Alaska 714015 722259 730825 736760 736759 737979
## 3 Arizona 6407002 6465488 6544211 6616124 6706435 6802262
## 4 Arkansas 2921737 2938640 2949208 2956780 2964800 2975626
## 5 California 37327690 37672654 38019006 38347383 38701278 39032444
## 6 Colorado 5048029 5116411 5186330 5262556 5342311 5440445
## 2016 2017
## 1 4860545 4874747
## 2 741522 739795
## 3 6908642 7016270
## 4 2988231 3004279
## 5 39296476 39536653
## 6 5530105 5607154
You will notice that I changed the state names to remove the periods, changed the attribute names, and changed the 2010 attribute to neumaric (helpful later, when we want them to be numbers: it was treating it as text). For more information on string manipulation, check this link.
However, we’re not done. We want to be able to join the violence dataset to this based on year and state eventually. This dataset as it stands is not very condusive to this, it would be better if there was a separate attribute for each state in each year. This also plays into the idea of tidy data (incidentally, this seems to be the paper Professor Bravo used for his class notes–might want to cite this Professor).
#gather(dataset, what attribute names turns into, what the data turns into,
# what not to include)
pops_tidy <- gather(pops_trimmed, year, population, -State)
head(pops_tidy)
## State year population
## 1 Alabama 2010 4785579
## 2 Alaska 2010 714015
## 3 Arizona 2010 6407002
## 4 Arkansas 2010 2921737
## 5 California 2010 37327690
## 6 Colorado 2010 5048029
Also, the violence_full dataset has more attributes than we need, and the date attribute is not seen by R as a date. To use it effectivly as a date, we would have to “date” attribute to a date attribute. This involves some fanciness, and we really don’t need the full date for this project, so we won’t bother. We will create a year attribute for use when joining with the population data for analysis instead.
violence_trimmed <- violence_full %>%
#select the 4 attributes we want
select(date, state, longitude, latitude) %>%
mutate(date = str_sub(date, 1, 4))
head(violence_trimmed)
## date state longitude latitude
## 1 2013 Pennsylvania -79.8559 40.3467
## 2 2013 California -118.3330 33.9090
## 3 2013 Ohio -82.1377 41.4455
## 4 2013 Colorado -104.8020 39.6518
## 5 2013 North Carolina -79.9569 36.1140
## 6 2013 Oklahoma -95.9768 36.2405
This dataset is really cool and I regret chopping out this much data. I am sure someone particularly motivated could do some really cool stuff with all the data this thing has. If you haven’t looked more closely at it yet, do so!
Because we are taking so long to get to our final data, and because we now have a hyper cool dataset with latitude and longitude, we are going to make cool maps! Hooray!
One thing about this is, as a tangent, I will not be going into extreme detail about this section.
Here, we will make an ultra-cool layered map, with a layer per year of data, showing the location of the reports of gun violence.
#the cool library for making maps
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.4.4
#get rid of entities with NA values
violence_map <- na.omit(violence_trimmed)
#start map pipeline
markermap <- leaflet(violence_map) %>%
addTiles() %>%
#get longitute of one year
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(longitude)))},
#same for latitude
~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(latitude)))},
radius = 2,
color = 'red',
#assign this layer an identifier
group="2013")%>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(latitude)))},
radius = 2,
color = 'orange',
group="2014") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(latitude)))},
radius = 2,
color = 'green',
group="2015") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(latitude)))},
radius = 2,
color = 'blue',
group="2016") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(latitude)))},
radius = 2,
color = 'purple',
group="2017") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(latitude)))},
radius = 2,
color = 'black',
group="2018") %>%
addLayersControl(
#gather all the layers togeather.
overlayGroups = c("2013", "2014", "2015", "2016", "2017", "2018"),
options = layersControlOptions(collapsed = FALSE)
) %>%
#start with only the 2013 layer showing.
hideGroup("2014") %>%
hideGroup("2015") %>%
hideGroup("2016") %>%
hideGroup("2017") %>%
hideGroup("2018")
#show it!
markermap
I know that this probably isn’t the most efficient way to write it, but it works…barely. This map is really slow. I recommend when manipulating it you turn off all the layers, turning them back on when the view is where you would like it to be. Usually one would add labels to the data, but it is slow enough as it is I’m not going to push it. Seriously, when trying to make this, I had to restart RStudio about 10 times as I made it less and less fancy (I started with awesome markers, labels, and the like). It just goes to show how big this dataset is.
The map for 2013 is relativly tame. There are a lot of dots, especially in urban areas, and it was less than what I expected. However, when you turn on the filter for any other year, the map just exploads. It blew my mind how many firearm-related incidents were recorded in this dataset. Even for year 2018, which isn’t done yet, has a ton of datapoints. It is cool to zoom to your hometown or an urban center and turn the filters on one by one and see how much of this is going on. As I alluded to in the last paragraph, as flippant as I may be in the introduction paragraphs, this is a really serious issue that needs to be approached with great thought, but more importaintly, action to go with it.
Relating to this project, it makes me wonder if a lot of data is missing for 2013. The Kaggle site reports that this data has “all recorded gun violence incidents in the US between January 2013 and March 2018, inclusive,” but the organization it got its data form only has maps since 2014 and this questionable site shows no dramatic increase between 2013 and 2014. Therefore, I will only use 2014-2107, the complete years, for analysis.
You can find more about the leaflet here.
[coming soon]
Here, we are going to finally answer the question of whether or not regulation has an impact on gun violence (in our somewhat flawed, generalized way). We are, unfortunatly, going to have to modify some data sets some more. gunlaws_full already has a total_laws column, so we don’t have to do any more processing on that beyond slicing out all of the middle rows. Population is already mostly ready from the work above. Turns out, though, that we have to remove DC from population because the gun laws dataset doesn’t include DC and remove all years except 2014-2017 inclusive becuase of the violence dataset.
However, our incident database is not finished. It is a list of incidents, we want it to become a list of all of the incidents that happen in a certain state in a certain year. This website demonstrates how to do various aggegations and tallies. A very interesting but seemingly still accurate website has some good tips for how to aggregate data like this for an old library that has been updated, but I like the name of the site so I left it in anyways.
#count separate occurences of each pair of date and state
violence_summ <- violence_trimmed %>%
count(date, state)
#select only the good data, unfortunatly have to remove DC...
violence_summ <- violence_summ %>% filter(2014 <= date & date <= 2017 & state != "District of Columbia")
head(violence_summ)
## # A tibble: 6 x 3
## date state n
## <chr> <fct> <int>
## 1 2014 Alabama 1318
## 2 2014 Alaska 146
## 3 2014 Arizona 556
## 4 2014 Arkansas 572
## 5 2014 California 3732
## 6 2014 Colorado 556
I should have mentioned before what the c() function does! It takes its arguments and makes a vector out of them. You saw this before when I was renaming the columns in the population sheet and when grouping the layers for the map.
Now that this data is in the right format, we must join the population and law counts to it. A join takes two datasets and combines them on rows that are the same. In this example, each dataset has an entity with a unique date and state. Because all of our datasets will have rows that correspond exactly, it is not neccessary to pick a specific type of join, but the types are discussed in detail here.
First though, fix up the laws and the population datasets for the final time…
gunlaws_trimmed <- gunlaws_full %>%
transform(year = as.character(year)) %>%
select(state, year, lawtotal) %>%
filter(2014 <= year & year <= 2017 & state != "District of Columbia")
pops_tidy_trimmed <- pops_tidy %>%
filter(2014 <= year & year <= 2017 & State != "District of Columbia")
combined <- violence_summ %>%
#join on two conditions
left_join(gunlaws_trimmed, by = c("state" = "state", "date" = "year")) %>%
#join again on two conditions
left_join(pops_tidy_trimmed, by = c("state" = "State", "date" = "year")) %>%
#get a incidents_per_person number
mutate(per_pop = n/population)
## Warning: Column `state` joining factors with different levels, coercing to
## character vector
head(combined)
## # A tibble: 6 x 6
## date state n lawtotal population per_pop
## <chr> <chr> <int> <int> <dbl> <dbl>
## 1 2014 Alabama 1318 10 4840037 0.000272
## 2 2014 Alaska 146 4 736759 0.000198
## 3 2014 Arizona 556 11 6706435 0.0000829
## 4 2014 Arkansas 572 11 2964800 0.000193
## 5 2014 California 3732 100 38701278 0.0000964
## 6 2014 Colorado 556 30 5342311 0.000104
Now that we finally have a good dataset, we need to make sure the preconditions to linear regression are met. We will call lawtotal the independant variable as its increase is argued by some to decrease firearm incidents, we will call it the independant variable and the per-population incidence rate the dependant variable.
Let us look at our graph first (using ggplot: here is some mind-numbing information on it):
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()
This data seems like it might slightly decrease with the number of regulation laws, but this is not very clear cut. One way to try and make data more linear is to take the log of the dependant variable. We will try that.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=log(per_pop))) +
#make labels
labs(x = "Number of Regulation Laws", y = "Log Incidents per Capita", "title" = "Fire Arm Regulation vs. Log Firearm Incidents") +
#make a dotplot
geom_point()
This has centered the data in the graph, but not made it much better. In fact, I would say that it has been made worse.
Reverting back to the original, not-log’d data, we will generate some more graphs to try and see if these conditions are met. First, we have to make a regression model and then feed it into the plot function.
#lm's formula is response[dependant]~terms[indepenent]
fit = lm(per_pop~lawtotal, data=combined)
#the which part selects which plots to show
plot(fit, which=1:2)
Also, we can use some functions from the broom library to get some stats on the regression.
library(broom)
glance(fit)
## r.squared adj.r.squared sigma statistic p.value df logLik
## 1 0.02851464 0.02360814 8.895388e-05 5.811614 0.01683451 2 1582.696
## AIC BIC deviance df.residual
## 1 -3159.392 -3149.497 1.566733e-06 198
tidy(fit)
## term estimate std.error statistic p.value
## 1 (Intercept) 2.096411e-04 9.073062e-06 23.105876 4.082934e-58
## 2 lawtotal -5.844784e-07 2.424488e-07 -2.410729 1.683451e-02
After looking at this output and compairing it to the expectations of a good fit set out here, I would say that this is not a good regression line. The residuals in the “Residuals vs. Fitted” graph are all clustered to one side; with a good regression line, there would be no decernable pattern. The Normal Q-Q plot seems to show a curve: a good regression will produce normally distributed resids, which will appear as a horizontal line on that plot. The \(R^2\) value is the percentage of the dependent variables that can be explained by the dependent. In this case, it is about 3%, which is very low. A good regression would have an \(R^2\) value of about 70%. The p-value of the slope is reasonably small (about .017), but I am still wary due to the other factors mentioned.
Even though this line is not very good, we can still plot it. The dark line is the regression itself, the light blue around it is a 95% confidence interval for the line.
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
stat_smooth(method="lm")
We see it is decreasing slightly.
However, due to the inconsistancies with the requirements for linear regression listed above, I do not believe that linear regression is the best choice. As far as regression goes, linear was about all that was discussed in class, so I am going to shrug my shoulders and give up on regression. I will though show you what R thinks is the best regression for this data:
combined %>%
#assign vars
ggplot(aes(x=lawtotal, y=per_pop)) +
#make labels
labs(x = "Number of Regulation Laws", y = "Incidents per Capita", "title" = "Fire Arm Regulation vs. Firearm Incidents") +
#make a dotplot
geom_point()+
#add prediction
geom_smooth()
## `geom_smooth()` using method = 'loess'
See how this line is all wavy? That relationship is not very linear.